import pandas as pd
import re
pd.options.display.float_format = '{:,.2f}'.format
from sklearn.metrics import auc, roc_auc_score, log_loss
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import plotly.express as px
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
nltk.download('stopwords')
from nltk.corpus import stopwords
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
from xgboost import XGBClassifier, plot_importance
import warnings
from sklearn.metrics import classification_report, confusion_matrix, plot_roc_curve
warnings.filterwarnings('ignore') #hide the warnings
The data consists information about Kickstarter (crowdfunding platform) projects. Data link
Our main task is to predict if a project will be successful before it is released
df = pd.read_csv('ks-projects-201801.csv')
The most important step is to know and feel your data. You need to inspect the data and its properties: data shape, data types, compute descriptive statistics, etc.
.info() helps us to know the shape and the data types,along with finding whether they contain null values or not. Dataset comprises of 378,661 observations and 15 characteristics.
df.info(null_counts=True) #"null counts" provides the null amount of each column
To take a closer look at the data I recommend to use .head() which returns first five samples of the data set.
Similarly .tail() returns last five samples of the data set.
df.head()
In this stage, we are performing initial investigations on data so as to discover multicollinearity, to recognize patterns and to spot anomalies using statistical graphics and data visualization methods.
This step is varied and may be extensive. I chose to present several major methods, but you can dive deeper into the data analysis.
df['ID'] = df['ID'].astype(str)
You can generate descriptive statistics using describe() function. This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.
Here as you can notice mean value is higher than median (50%) value of each column. Moreover, there is notably a large difference between 75th and max values of each columns. Thus observations suggests that there are outliers in our data set.
df.describe()
Our task is to predict if a project will succeed or not. So, we should transform the "state" column to binary. First, we look at the counts of unique values of "state". Then, we define any state, excluding "successful", as "failed"
The value_counts() function tells us the relative frequencies of each state value in descending order. It observed that "failed" and "successful" are the most common state, and a few observations made for "canceled", "live" and "suspended".
df['state'].value_counts(normalize=True)
df['binary_state'] = df['state'].apply(lambda x: 1 if x =='successful' else 0)
Here, we compute the pairwise correlation of numeric columnsin order to find highly correlated features with the target or multicollinearity.
Dark color represents positive correlation while the lighter color represents negative correlation.
We can infer that "usd_pledged" and "pledged" have a strong positive correlation with "usd_pledged_real". Similarly, "goal" and "usd_goal_real" have a positive correlation. Logically, "pledged" also has a high correlation with "backers".
We should consider those correlations on the features selection stage.
sns.heatmap(df.corr(), cmap='Blues')
Display histogram of all the continuous features. A histogram allows us to see the distribution of a particular variable
df.hist(bins=20,figsize=(25, 20),)
A box plot shows the distribution of quantitative data. It can tell you about outliers and how the data skewed. More information abount Boxplot can be found here
sns.boxplot(x= 'binary_state',y = 'usd_pledged_real', data=df)
#df['backers'].hist()
Since our data is highly skewed, we can use "log" on the y-axis to better visualize the data.
Here, we can observe that "usd_pledged_real" differently distributes between successful and failed projects.
px.box(x= 'binary_state',y = 'usd_pledged_real', data_frame=df,log_y=True)
As the goal average of a "failed" projects is higher than a "successful" projects, we can infer that a project may fail when its goal is too high.
px.box(x= 'binary_state',y = 'usd_goal_real', data_frame=df,log_y=True)
Here I use a bar plot to show labels frequency over project's category.
The highest success rate is received by "Dance" and "Theater" categories.
plt.xticks(rotation='90')
sns.barplot(x='main_category', y= 'binary_state', data=df)
Word Clouds are visual representations of words that give greater prominence to words that appear more frequently.
Here I use it to find the most common words in the Kickstarter projects names.
words = [i.lower() for i in df.name.dropna() if i not in(STOPWORDS)]
words = " ".join(words)
wordcloud = WordCloud(background_color='white').generate(words)
plt.figure(figsize = (10, 10), facecolor = None)
plt.imshow(wordcloud)
plt.axis('off')
The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.
First, we look at samples with NaN values and try to find reasons for missing data.
Here, we have 3801 samples with NaN values in any column. We can observe that "NaN" values probably follow "undefined" state as well as "N,0"" country.
# Display rows with one or more NaN values
df[df.isna().any(axis=1)]
I chose to present two ways to impute missing values:
# remove all the rows that contain a missing value
df = df.dropna()
# replace all NA's with 0
df.fillna(0, inplace=False) # if you want to update the df, you can use 'inplace=True'
In this stage we extract features from raw data. Our goal is to improve the performance of machine learning models by providing him insightful features.
Though date columns usually provide valuable information about the model target, they are hard to understand by algorithms due to their format. So, the prepossessing is very important.
# Transform string to date
df['deadline'] = pd.to_datetime(df['deadline'])
df['launched'] = pd.to_datetime(df['launched'])
# extract parts of the date
def time_features(df, column):
df[f'weekday_{column}'] = df[column].dt.dayofweek
df[f'monthday_{column}'] = df[column].dt.day
df[f'month_{column}'] = df[column].dt.month
df[f'year_{column}'] = df[column].dt.year
return df
df = time_features(df, 'deadline')
df = time_features(df, 'launched')
df['hour_launched'] = df['launched'].dt.hour
I like to visualise the target variable mean over week days and hours.
Here I chose to examine the success rate across the project launched time. We can see that The hourly "success" pattern looks pretty similar every day. However, there are a couple of peak hours e.g. 6, 9 and 15.
plt.figure(figsize = (10, 7), facecolor = None)
plt.xticks(df['hour_launched'].unique())
sns.lineplot(x='hour_launched', y='binary_state', data=df , hue='weekday_deadline', err_style=None, palette='tab10')
Preparing raw text data to make it suitable for a machine learning model, including text cleaning, stopwords removal etc.
stop = stopwords.words('english')
def avg_word(sentence):
words = sentence.split()
return (sum(len(word) for word in words)/len(words))
df['count_words']= df['name'].apply(lambda text: len(tokenizer.tokenize(text))) # counts the number of tokens in the text (separated by a space)
df['stopwords'] = df['name'].apply(lambda x: len([x for x in x.split() if x.lower() in stop])) #counts the number of stop words in the text
df['avg_word'] = df['name'].apply(lambda x: avg_word(x)) # average of word length: sum of words length divided by the number of words (character count/word count)
Select features based on the insights from the EDA and features engineering stages.
num_features = ['usd_goal_real','count_words','stopwords', 'avg_word','weekday_deadline', 'monthday_deadline', 'month_deadline',
'year_deadline', 'weekday_launched', 'monthday_launched',
'month_launched', 'year_launched', 'hour_launched']
cat_features = ['category','main_category','country']
features=num_features+cat_features
# Split into train and test
train, test = train_test_split(df,random_state=42,test_size=0.3,)
y_train=train['binary_state']
y_test=test['binary_state']
X_train= train.filter(features, axis=1)
X_test= test.filter(features, axis=1)
Most machine learning algorithms can't work with categorical data directly. This means that categorical data must be converted to a numerical form. There are various types of categorical data encoding methods, e.g., Label Encoding and Target Encoding.
Here I use One Hot Encoding method. This method spreads the values in a categorical column to multiple binary columns. These binary values express the relationship between grouped and encoded columns.
X_train= pd.get_dummies(X_train)
X_test= pd.get_dummies(X_test)
#Align the number of features across test set based on train dataset
X_train, X_test = X_train.align(X_test, join='inner', axis=1)
In most cases, the numerical features spanning varying ranges and units. This fact may be a significant obstacle to several machine learning algorithms such as gradient descent based and distances based algorithms.
So, Normalization and Standardization solve this problem.
Here I chose to use min-max normalization, but you can use Standardization methods like StandardScaler, depend on your task and the machine learning algorithm you are using.
scaler = MinMaxScaler()
for f in num_features:
scaler = MinMaxScaler()
X_train[f] = scaler.fit_transform(X_train[f].values.reshape(-1,1))
X_test[f] = scaler.transform(X_test[f].values.reshape(-1,1))
In this section we train the model on our data and evaluate the performances.
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_probs = xgb.predict_proba(X_test)[:,1]
logloss = log_loss(y_test, y_probs)
roc_auc = roc_auc_score(y_test, y_probs)
print(f'auc: {roc_auc:0.2f} , log-loss: {logloss:0.2f}')
y_pred = xgb.predict(X_test)
print(classification_report(y_test, y_pred))
labels= sorted(list(set(y_train)))
cm= confusion_matrix(y_test,y_pred,labels=labels)
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax,cmap="BuGn",fmt='d'); #annot=True to annotate cells
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
plot_roc_curve(xgb, X_test, y_test)
In many business cases, it is equally important to have an accurate and interpretable model. That is, we want to know which features are most important in determining the forecast. Feature importance analysis helping us with model interpretation
feature_important = xgb.get_booster().get_score(importance_type='gain')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=True).head(20)
data.plot(kind='barh')